Lab Assignment One: Exploring Table Data

Sian Xiao

1. Business Understanding

1.1 Data Source

This dataset is selected from UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Wine+Quality. It's originally published on Decision Support Systems in 2009 (https://doi.org/10.1016/j.dss.2009.05.016). It contains two datasets related to red and white variants of the Portuguese "Vinho Verde" wine (a unique product from the Minho (northwest) region of Portugal), size of each is 1599 for red wine and 4898 for white wine. This dataset can be used to predict the quality of wine with their physicochemical indice.

1.2 Data Significance

According to Oxford Dictionary, quality is "the standard of something as measured against other things of a similar kind; the degree of excellence of something". There are a few approaches to score wine: a discrete quality level terminology (poor, acceptable, good, very good, outstanding); a "star" system (five star scoring system); 100-credit or 20-credit scale. (https://wineandotherstories.com/the-six-attributes-of-quality-in-wine/) The United States consumes the largest volume of wine of any country, at 33 million hectoliters in 2020 (https://www.statista.com/statistics/858743/global-wine-consumption-by-country/), so this is a large market. The significance of predicting wine quality needs to be emphasized.

The quality is usually determined by a professional, well-trained wine expert. Sometimes, we watch the competition that critics blindly taste the wine to give a comment and take a guess from a movie or TV program. However, this is not feasible for the wine industry. First, massive annual yield and consumption makes it impossible to taste them. Second, the majority of wine is normal quality (those experts just show up for the excellent ones) and there is no need to taste them one by one. Third, the decision of quality by experts is subjective and hard to quantified in that different experts give different values so it has no universal and decisive criteria but replies mainly on human experts. Also, taste is the least understood of the human senses (according to one citation in this paper). Many measurements of physicochemical properties are easy and cheap with modern techniques (as a PhD student in Chemistry I'm confident about this), and these properties are related to the quality of wines. If we could use them as input to get quality as output, we won't worry about the above concerns: the process becomes easy and objective!

In conclusion, our task is to use physicochemical properties to train a model, and predict or, to say, calculate the quality of wines. The specific companies would be interested for two reasons: they can easily provide this data and they are desired to rank their products with quality to price them.

1.3 Data Summary

As UCI website said "Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.)." The donated dataset contains 12 attributes, among which 11 are input variables based on physicochemical tests and 1 is output variable based on sensory data. Type, scales, and range of attributes are shown in the table below.

Attributes Scales Discrete/Continuous Range
1 fixed acidity ratio continuous 3.8-15.9
2 volatile acidity ratio continuous 0.08-1.58
3 citric acid ratio continuous 0-1.66
4 residual sugar ratio continuous 0.6-65.8
5 chlorides ratio continuous 0.009-0.611
6 free sulfur dioxide ratio continuous 1-289
7 total sulfur dioxide ratio continuous 6-440
8 density ratio continuous 0.98711-1.03898
9 pH ratio continuous 2.72-4.01
10 sulphates ratio continuous 0.22-2
11 alcohol ratio continuous 8-14.9
12 quality ordinal discrete 3-9

1.4 Measure of Success

This dataset can be viewed as classification or regression tasks.

If we treat it as a regression task: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Squared Mean Error (RMSE), correlation coefficient R (and R Square definitely) are common metrics. The higher the correlation coefficient and the smaller the errors, the better performance a model has. In this case, if we set a tolerance factor as 0.5, the predicted value should be within deviation of 0.5 compared to the real value.

If we treat it as a classification or regression task, we need to consider whether it's a multi-class or binary classification. If we treat it as a multi-class classification, in this case, the predicted value should be the real value (similar to previous tolerance factor of 0.5, since this is a discrete value in classification from 0 to 10, 1 is really large). If we treat it as a binary classification, we set the quality as good/bad and the predicted category should just be the real one. We can use confusion matrix to visualize and evaluate the performance of classification models. The indicators commonly used are precision, accuracy, recall (or sensitivity) and specificity.

Real positive Real negative
Predicted positive True positive (TP) False negative (FN)
Predicted negative False positive (FP) True negative (TN)

2. Data Understanding

The first thing to do is to import some necessary libraries in explore table data.

2.1 Data Preparation

The task of this part is to load the dataset into two pd.DataFrames, named as df_red_origin and df_white_origin.

2.2 Data Quality

First, we will give some basic information about the dataset and check whether there are any missing values.

Great! It looks that there is no missing value in the dataset (I'm lucky considering you said 90% of repositories in UCI ML Repository have missing data), which is in line with the introduction of the dataset. I don't want to randomly delete some data to show my imputation skills here.

Is it true that we don't have missing data? Do we have weird data which is actually something crazy? As far as I can tell, all the data can be reasonable, so we treat it as complete dataset.

Anyway, we don't need to impute missing data here, so let's check the duplication.

There are a lot of duplicate entries detected by pandas in the dataset. Are they duplicates and do we drop them? In my opinion, we should drop them and here is my reason. The data type of all 11 variables (excluding quality) is float64, possibility that all 11 variables are the same is too small. As a Chemistry major student, I can't believe in experimental situation we get same data for different samples. Thus, these duplicated values must come from human error. Let's drop them and go on.

Now that we drop the duplicate entries, we can summarize the real data.

Based on the above table, we might give a guess there are outliers in many features. We don't need to discuss here, we will do it later. Let's check the distribution of quality.

The quality of red wine is in the range of 3 to 8, and that of white wine is 3 to 9. The majority lies in range 5 to 7, which means the quality is not balanced (e.g. there are munch more normal wines than excellent or poor ones).

Let's convert the quality into qualified and unqualified using classic Pass/Fail strategy as exams (our favorite!). I would like to replace the quality with eligibility (wine with quality larger than or equal to 6 will pass our exam and be marked as eligible, which will have value 1 in eligibility, others will be marked as 0). This would turn our problem from multi-class classification or regression into binary classification. They will become two new pd.DataFrame named as df_red, df_white.

For red wine, the eligibility is respectively balanced. But for white wine, amount of qualified instances is almost twice as much as that of unqualified instances. We might give a guess that white wine has higher possibility to be qualified.

I renamed the features.

3. Data Visualization

Here is my thought about two datasets. I'm trying to combine they two, so I plot the correlation heat map to see whether they have similar feature correlation within their own dataset.

Also, what correlation would be found and how to interpret them?

I'm not good at recognizing colors for these colors, so let's try seaborn to draw a heat map with correlation data. The palette comes from this website https://vitalflux.com/correlation-heatmap-with-seaborn-pandas/. The palette looks better for me.

From my point of view, red wine and white wine share similarity, to some extent, in feature correlation. Let's combine the data of red wine and white wine to see the result. Before combining, we should first add a column to indicate which kind of wine they are. df is obtained from these two steps.

We combine 1359 red wine instances and 3961 white wine instances to get 5320 wine instances. Let's draw the correlation heat map again to see if the correlation changes a lot.

I can't understand why correlation between some features changed a lot, some even changed signs. Let's draw a change heat map with absolute value rounded to tenths decimal.

We can see that the correlation between eligibility and alcohol and other features didn't change a lot, so they are stable for two wines, so alcohol is an important feature for both wines. On the contrary, the correlation of FA/VA, TSO2/VA, TSO2/Sulphates changed a lot. In my opinion, this gives a conclusion that red wine and white wine differ in these fields, which result in the different tastes (comes from someone who don't understand wines).

In this case, I'm wondering if I should continue with combined data. However, it's not a classification assignment (the feature Type will give a weight value to guide the model), we can use visualization to show the differences! Awesome!

We can see that FSO2 and TSO2 are highly positive correlated. As free sulfur dioxide is part of total sulfur dioxide, this makes sense. While Density and Alcohol are highly negative correlated. As we all know, density of alcohol is smaller than density of water, and the density of mixture decreases with alcohol content, this makes sense too. Let's delete FSO2 and Density and plot it again.

We can see VA is relatively highly negatively correlated with Eligibility, that's because wine fermentation is a chemical process that turns raw materials into alcohol, but it can further turns to acid. So the wine with high VA would be like vinegar and has low quality. While Chlorides is related with salt (it's anions of salt), and high content of salt would damage wine quality, so it has a negative correlation with Eligibility.

From the three plots above many features have long tails, such as FA, CA, RS, Chlorides, Sulphates, indicating outliers.

Most of features share similar distribution for both qualified and unqualified. However, it's easy to notice the distribution of alcohol is remarkably different. Recall that alcohol has a stable correlation with quality. Does higher alcohol content result to better quality?

We can easily see that Eligibility rate increased with the alcohol! The samples in (14,15] are only three so we don't count it. Then our conclusion is that alcohol content is important.

Next, we can plot against red and white wine.

We can see that red wine and white wine differs a lot in VA and TSO2. These might help to predict wine type. Can we use the dataset to predict whether it's red wine or white wine with only data?

These features are not very good classifiers so we can use principal component reduction (PCA) to transform the features into two principle components and plot them in two dimensions.

4. Dimensionality Reduction

4.1 Dimensionality Reduction Introduction

Dimension reduction plays an important role in data science, being a fundamental technique in both visualization and as pre-processing for machine learning.

4.2 UMAP

UMAP is short for Uniform Manifold Approximation and Projection for Dimension Reduction. It is a novel manifold learning technique for dimension reduction, theoretically based on manifold theory and topological data analysis. (https://arxiv.org/pdf/1802.03426.pdf)

The dimension is reduced from to 2, the two dimension is plotted above, we can see different eligibility falls to different ends. But the boundary it not that clear, which might comes from poor original features.

https://wineandotherstories.com/the-six-attributes-of-quality-in-wine/ provides some criteria. "In white wines, acidity and alcohol/sugar should match; if the acidity is not enough compared to the wine’s sweetness level, the drink will appear cloying. For reds, tannins, acidity and alcohol should all be in balance." Let's try to use only alcohol and acidity as input.

Actually it's better, so the original features might influence the result. I know little about PCA so let's learn this later.

With selected features, the classification capability increased a lot. And we have an unexpected gift. We can use alcohol and acidity to classify different types of wines! So there might be an assumption that red wine and white wine are different because they have different component contents. Thus, they may have different criteria for quality. Maybe we should not combine the red wine and white wine.

5. Discussion

We use wine quality dataset here to practice the data preparation, exploration, and visualization skills.

There are some small conclusions deduced from our work:

There are some unfinished thoughts (some are limited by skills):

Reference

UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Wine+Quality

When Cheap Wine Just Isn't Worth It https://www.thedailymeal.com/drink/when-cheap-wine-just-isnt-worth-it

Modeling wine preferences by data mining from physicochemical properties https://doi.org/10.1016/j.dss.2009.05.016

The six attributes of quality in wine https://wineandotherstories.com/the-six-attributes-of-quality-in-wine/

Correlation Concepts, Matrix & Heatmap using Seaborn https://vitalflux.com/correlation-heatmap-with-seaborn-pandas/

Wine consumption worldwide in 2020, by country (in million hectoliters) https://www.statista.com/statistics/858743/global-wine-consumption-by-country/

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction https://arxiv.org/pdf/1802.03426.pdf